After all the set up for database in SQL was completed, let's we try to fetch the data using SQL query in python and analyze them to perform useful business insight.
Here below the several library that we will use to analyzing AirBnB Data :
# Data wrangling
import pandas as pd
import numpy as np
import re
# Connect MySQL
import mysql.connector
# Data Visualization
import plotly.express as px
# Draw map
import folium
from folium.plugins import MarkerCluster
%matplotlib inline
Frist of all we need to create a connection in our database.
# conn = mysql.connector.connect(
# host = "db4free.net",
# port = 3306,
# user = "user",
# password = "password!",
# database = "database-name"
# )
conn = mysql.connector.connect(
host = "localhost",
port = 3306,
user = "root",
password = "",
database = "airbnb_data"
)
Let's check the available table in our database.
pd.read_sql_query("SHOW TABLES", conn)
# number of listings
number_listing = pd.read_sql_query("SELECT COUNT(*) AS freq FROM listing", conn)
number_listing['freq'][0]
# number of host
number_host = pd.read_sql_query("SELECT COUNT(*) AS freq FROM host_info", conn)
number_host['freq'][0]
# number of review
number_review = pd.read_sql_query("SELECT SUM(number_of_reviews) as num_review FROM listing", conn)
number_review
For the first business insight, we will take a look deep down into variables price and property to get useful information about the corelation between them and extract the hidden information. Here's several business question we want to answer :
The first business question is to get information about price distribution based on geograpichal location. Information will perform in map visualization, so we will using folium package from python to visualize longitude and latitude inside the data and creating popup to display the detail information in each point.
query = '''
SELECT name, SUM(price) AS price, longitude, latitude
FROM listing
GROUP BY name, longitude, latitude
'''
df_query = pd.read_sql_query(query,conn)
df_query.head()
df_query.isna().sum()
map_sg = folium.Map(location=[df_query.latitude.mean(), df_query.longitude.mean()], zoom_start=5)
marker_cluster = folium.plugins.MarkerCluster().add_to(map_sg)
for name, lat, lon, price in \
zip(df_query['name'],df_query['latitude'], df_query['longitude'], df_query['price']):
popup = folium.Html("<b>" + name + "</b><br>" +\
"Price: {:,}".format(price) + "<br>", script = True)
popup = folium.Popup(popup, max_width=2650)
folium.Marker(location = [lat, lon],
popup = popup
).add_to(marker_cluster)
map_sg
Second business question is what is the most common property type inside the data?
We can use COUNT and GROUP BY syntax from SQL query to calculate how many frequency does the each property appear on the data. After get the calculation we can order them using ORDER BY syntax and select only top 10 information using LIMIT query.
query = '''
SELECT property_type, COUNT(property_type) AS frequency
FROM listing
GROUP BY property_type
ORDER BY frequency DESC
LIMIT 10
'''
df_query = pd.read_sql_query(query,conn)
df_query.head()
fig = px.bar(df_query, x='frequency', y='property_type',
title="Most Common Property Type in Listings",
orientation='h',
labels={
'property_type' : 'Property Type',
'frequency' : 'Frequency'
})
fig.show()
Visualization above show that private room in apartment is the most common used in AirBnB hotel, whilst the least used is private room in twonhouse.
The third question is what is the top and bottom 10 property type based on their average price.
When we only calculate the average price without considering the number of appearances in each property type, the result will lead to irrelevant information. Why? Because there are several property type that have deluxe and high-priced, but only few or even one hotel who have those facility. So, we need to filter the property type which have more than 20 frequency from data.
from plotly.subplots import make_subplots
import plotly.graph_objects as go
query = '''
SELECT property_type, average_price, frequency
FROM
( SELECT property_type, AVG(price) AS average_price, COUNT(*) as frequency
FROM listing
GROUP BY property_type
) as new_data
WHERE frequency >=20
ORDER BY average_price DESC
LIMIT 10
'''
df_query = pd.read_sql_query(query,conn)
df_query = df_query.sort_values(by='average_price', ascending=True)
df_query
plot1 = go.Bar(x=df_query['average_price'],
y=df_query['property_type'],
name='Top 10',
orientation='h',
marker=dict(
color = 'blue'
)
)
query = '''
SELECT property_type, average_price, frequency
FROM
( SELECT property_type, AVG(price) AS average_price, COUNT(*) as frequency
FROM listing
GROUP BY property_type
) as new_data
WHERE frequency >=20
ORDER BY average_price
LIMIT 10
'''
df_query = pd.read_sql_query(query,conn)
df_query.sort_values(by='average_price', inplace=True)
plot2 = go.Bar(x=df_query['average_price'], y=df_query['property_type'],
orientation='h',
name='Bottom 10',
marker=dict(
color = 'red'
)
)
sub = make_subplots(rows=2, cols=1)
sub.append_trace(plot1, 1,1)
sub.append_trace(plot2, 2, 1)
sub.update_layout(height=600, width=800, title_text="Top and Bottom 10 Property based on Average Price",xaxis2=dict(range=[0,270]))
Entire service apartment become the top listing with highest average price, following with Entire house and Room in boutique hotel in top 10 listing with the highest average price. Whilst Shared room in bed and breakfast have lowest average price.
Above information shown about top and bottom property type based on average price, but how about the room type?
Using the same technique and analysis, let's perform the hidden information between room type and average price!
query = '''
SELECT room_type, average_price, frequency
FROM
( SELECT room_type, AVG(price) AS average_price, COUNT(*) as frequency
FROM listing
GROUP BY room_type
) as new_data
WHERE frequency >=20
ORDER BY average_price DESC
'''
df_query = pd.read_sql_query(query,conn)
fig = go.Figure(data=[go.Pie(labels=df_query['room_type'], values=df_query['average_price'], hole=.4)])
fig.update_traces(hole=.4, marker=dict(colors=['#e57373', '#f06292', '#4db6ac','#81c784']))
fig.update_layout(
title_text="Room Type by Average Price")
fig.show()
Unlike property type columns which have many kind of type property, room type have only 4 unique value inside, they are : Hotel room, entire home/apt, private room, and shared room. And the top provided room type in AirBnB data is Entire home/apt, while the less provided room type is Shared room
In the last business question about the price, let's gather information wether amenities has correlation with price.
query = '''
SELECT amenities, price
FROM listing
'''
df_query = pd.read_sql_query(query, conn)
Most amenities
First of all we need to check and create a dataframe to perform information about detail amenities and their frequency.
most_amenities = df_query['amenities'].str.split(', ', expand=True).stack().value_counts().to_frame("Total").head(10)
most_amenities.sort_values(by='Total',ascending=True)
fig = px.bar(most_amenities, x='Total', y=most_amenities.index,
title="Top Amenities",
orientation='h',
labels={
'index' : 'Amenities',
'Total' : 'Frequency'
})
fig.show()
Almost all of the available listing has Air conditioning and Long term stays allowed, followed by Essentials, Wifi, and Washer. More than 60% of all listing also has Dedicated workspace and Kitchen, which may require more spaces outside the bedroom.
Let's check whether amenities have correlation with price. Commonly where listing have complete ameneties the price will increase.
df_query['count_amenities'] = df_query.amenities.str.strip().str.split(',').apply(len)
df_query.head()
px.scatter(df_query,
x = np.log10(df_query['price']),
y = df_query['count_amenities'],
title='Correlation Between Amenities and Price',
opacity=0.5,
labels={
'x' : 'Price',
'count_amenities' : 'Count Amenities'
})
Let's check whether listing with higher price will give higher overall experience as well using simple scatterplot. We will scale the listing price to log10 units. We only collect listing with non-NULL value for the review scores rating. We will also filter the data by collecting listing that has more than 10 number of reviews.
# Collect Required Columns
query = """
SELECT price, review_scores_rating
FROM listing
WHERE review_scores_rating IS NOT NULL AND number_of_reviews > 10
"""
df_query = pd.read_sql_query(query, conn)
px.scatter(df_query,
x=np.log10(df_query.price),y='review_scores_rating',
title='Correlation Between Price and Overall Experience',
labels={
'x' : 'Price',
'review_scores_rating' : 'Scores Rating'
},
opacity=0.5
)
We will continue answering the research question by looking at the host data.
We will see who is the top host based on the total earning cumulated from his/her listings. We will use the following formula to calculate the total earning:
$$ Total\ earning = \Sigma_{i=1}^n price_i\times number\ of\ reviews_i \times minimum\ nights_i $$Notes:
Unfortunately, we don't have the detailed data on the number of stay for each customer, so we will use the minimum nights as the number of stay so what we actually calculate is the minimum total earning. We use the number of reviews as the proxy of number of customers.
query = """
SELECT host_id, host_name, SUM(total_earning) as total_earning, COUNT(*) as number_of_listing, AVG(price) as average_price
FROM
(SELECT listing.host_id, host_info.host_name, price, number_of_reviews, minimum_nights,
price * number_of_reviews * minimum_nights as total_earning
FROM listing
LEFT JOIN host_info
ON listing.host_id = host_info.host_id
WHERE host_name IS NOT NULL
) as new_table
GROUP BY host_id, host_name
ORDER BY SUM(total_earning) DESC
"""
df_query = pd.read_sql(query, conn)
df_query.head()
# Get top 15 host by number of listing
df_query.sort_values("number_of_listing", ascending= False, inplace = True)
df_viz = df_query.head(15).copy()
df_viz.sort_values("number_of_listing", inplace = True)
fig = px.bar(df_viz, x='number_of_listing', y='host_name',
title="Top Host by Number of Listing",
orientation='h',
labels={
'number_of_listing' : 'Number of Listing',
'host_name' : 'Host Name'
})
fig.show()
Only a handful of hosts has more than 5 listings. Now we will continue looking at the top 15 based on the Total Earning generated.
# Get top 15 host by total_earning
df_query.sort_values("total_earning", ascending= False, inplace = True)
df_viz = df_query.head(15).copy()
df_viz.sort_values("total_earning", inplace = True)
fig = px.bar(df_viz, x='total_earning', y='host_name',
title="Top Host by Total Earning",
orientation='h',
labels={
'total_earning' : 'Total Earning',
'host_name' : 'Host Name'
})
fig.show()
So, the list of host name is different from the top host by the number of listing and the top host by total earning. This indicates that a higher number of listings doesn't guarantee to give more earnings.
Does higher average price of listing from a single host correlate with higher total earning? Let's answer this questions using scatterplot.
fig = px.scatter(df_query, x = 'average_price', y = 'total_earning',
title="Average Price vs Total Earning",
opacity=0.5,
labels={
'average_price' : 'Average Price',
'total_earning' : 'Total Earning'
})
fig.show()
As we can see, there is no visible pattern between average price and total earning. However, some of the most highest total earnings are generated by host with relatively low average price from his/her listings. Therefore, higher average price from a host doesn't guarantee to give him/her a higher total earning
According to Airbnb, superhosts are experienced hosts who provide a shining example for other hosts, and extraordinary experiences for their guests.
In here we are interested to check if there is any difference in the distribution of review scores for listing owned by a superhost and normal host.
query = """
SELECT review_scores_rating, listing.host_id, host_info.host_name, host_info.host_is_superhost
FROM listing
LEFT JOIN host_info
ON listing.host_id = host_info.host_id
WHERE review_scores_rating IS NOT NULL and host_is_superhost IS NOT NULL
"""
df_query = pd.read_sql(query, conn)
# Transform superhost as Boolean/Logical
df_query.host_is_superhost = df_query.host_is_superhost.astype('bool')
df_query
fig = px.histogram(df_query, x='review_scores_rating', color='host_is_superhost',
title='Review Scores Rating',
labels={
'review_scores_rating' : 'Scores Rating',
'count' : 'Count'
})
fig.show()
Both superhost and normal host has the same peak for the distribution, indicating that on average there is no significant difference on review scores rating or the overall experience. However, superhosts has thinner distribution, indicating that superhost tend to have higher review score due to low variability.
Now we will look at the response rate and the acceptance rate between normal host and the superhost. The following is the detail description about response rate and acceptance rate according to Airbnb:
query = """
SELECT host_id, host_name, host_response_rate, host_acceptance_rate, host_is_superhost
FROM host_info
WHERE host_response_rate IS NOT NULL and host_acceptance_rate IS NOT NULL
"""
df_query = pd.read_sql(query, conn)
df_query.head()
# Transform superhost as Boolean/Logical
df_query.host_is_superhost = df_query.host_is_superhost.astype('bool')
for i in ['host_response_rate', 'host_acceptance_rate']:
df_query[i] = list(map(lambda x: float(re.sub("[%]", "", x)), df_query[i] ))
df_query.info()
Let's see the distribution of the response rate and acceptance rate.
fig = px.histogram(df_query, x='host_response_rate', color='host_is_superhost',
title='Host Response Rate',
labels={
'host_response_rate' : 'Host Response Rate',
'count' : 'Count'
})
fig.show()
fig = px.histogram(df_query, x='host_acceptance_rate', color='host_is_superhost',
title='Host Acceptance Rate',
labels={
'host_acceptance_rate' : 'Host Acceptance Rate',
'count' : 'Count'
})
fig.show()
Based on plots above, the distribution of response rate and acceptance rate is the same, indicating that there is no signficant difference between a normal host and a superhost.
When you host on Airbnb, you may be asked to provide information such as your legal name, date of birth, or government ID so it can then be verified. This is a mandatory step for a host to prevent fraud and other financial crimes. Let's deep dive into what kind of informations are mostly filled and verified from a host.
query = """
SELECT host_id, host_name, host_verifications, host_identity_verified
FROM host_info
WHERE host_identity_verified IS NOT NULL
"""
df_query = pd.read_sql(query, conn)
df_query.host_identity_verified = df_query.host_identity_verified.astype('bool')
df_query
Let's check if there is any host that is not verified yet.
df_query[ df_query.host_identity_verified == False]
There are a lot of host that is yet to be verified. Let's also check if there is any host that has missing value or no host_verification.
df_query[ df_query.host_verifications.isna() ]
Let's check the most commonly verified identity.
# Remove missing verification
df_query = df_query[ df_query.host_verifications.isna() == False]
df_query.reset_index(drop = True, inplace = True)
list_verify = []
for i in range(df_query.shape[0]):
verify_i = df_query.host_verifications[i].split(", ")
list_verify.extend(verify_i)
df_verify = pd.DataFrame(list_verify, columns = {"verified_info"})
df_verify
# Create frequency dataframe
df_viz = pd.DataFrame(df_verify['verified_info'].value_counts())
df_viz.reset_index(inplace = True)
df_viz.rename(columns = {'verified_info':"freq"}, inplace = True)
#Create ratio column
df_viz['ratio'] = df_viz.freq/df_query.shape[0]*100
#Get and sort the first 15 rows
df_viz = df_viz.head(15)
df_viz.sort_values('ratio', inplace = True)
df_viz
fig = px.bar(df_viz, x = 'ratio', y='index',
orientation='h',
title='Verified Information',
labels={
'ratio' : 'Percentage of Host',
'index' : 'Verified Info'
})
fig.show()
Phone number and email are the most common way to verify a host and has been done by more than 80% of all hosts, followed by the governemnt id of the host.
We have done some data analysis to better understand the information regarding the room listing and host from Airbnb in Bangkok. We have saw what is the most common room type available, is there any correlation between room price and the review score, who are the top 10 host, etc. You can continue further by building a beautiful analytics dashboard in Tableau or using any libraries from python, such as plotly dash, flask, or streamlit.
Don't forget to close your database connection if you are finished.
# Close Connection
mydb.close()